-
Notifications
You must be signed in to change notification settings - Fork 650
Add SLoC (Source Lines of Code) metric to versions #11453
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
This commit adds a new `JSONB` column called `linecounts` to the versions table to store Source Lines of Code statistics for each crate version. The column stores language breakdown and totals as structured `JSON` data, enabling flexible schema evolution without requiring additional migrations. The database schema and test snapshots are updated accordingly to reflect this new column structure.
This introduces a new workspace crate that provides line counting functionality using `tokei`. The crate includes `LinecountStats` and `LanguageStats` data structures for storing results, along with core analysis functions for processing file contents. The implementation includes language filtering to exclude non-programming files and path filtering to skip test and example directories. Comprehensive test coverage is provided with `insta` snapshots to ensure reliable functionality. This crate provides the foundation for adding SLOC metrics to crates.io by offering a clean, testable interface for analyzing source code statistics.
This adds the `linecounts` field to both the `Version` struct and `NewVersion` builder. The field stores linecount data as `JSON`, following the established pattern for flexible schema evolution without requiring additional migrations. The `linecounts` field is `Optional` to handle existing versions that don't have this data, and will be populated for new versions during the publish process. This design ensures backward compatibility while enabling rich source code metrics for future crate versions.
This enhances the tarball processing pipeline to include SLOC analysis by adding `crates_io_linecount` dependency to the tarball processing crate and extending the `TarballInfo` struct with a `linecount_stats` field. The integration occurs seamlessly during tarball file processing, where each qualifying source file is analyzed and its statistics are accumulated. All tarball processing test snapshots are updated to include linecount data, demonstrating the feature works correctly across various crate structures. The integration preserves existing functionality while adding minimal overhead to the tarball validation and processing pipeline.
This modifies the publish endpoint to extract and store linecount statistics by extracting linecount data from tarball processing results and serializing the stats to `JSON` for database storage. The linecount data is then passed to the `NewVersion` builder for persistence. All publish-related test snapshots are updated to include linecount data, demonstrating that the integration works correctly across various publish scenarios. The implementation maintains backward compatibility with null linecount values for any edge cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No specific concerns on the implementation here.
I do wonder a little if we want to do this synchronously during publish. I don't expect line count calculation to be terribly expensive in the common case, but to defend against potential pathological cases, I might feel better about this if it was a background job. Not sure if anyone else on @rust-lang/crates-io has strong feelings here.
This PR introduces basic source code analysis for newly published versions. A new
crates_io_linecount
workspace crate uses thetokei
crate to analyze source files during the publish process. The system collects language breakdowns and line count statistics, storing them as JSON in a newlinecounts
column on theversions
table.The analysis runs during tarball processing and excludes test directories and non-programming files. All existing functionality remains unchanged, with the new column being optional for backward compatibility.
Note that this is only the first step in a series of pull requests. The follow-up PRs will: